24 research outputs found
Spatial Temporal Transformer Network for Skeleton-based Action Recognition
Skeleton-based human action recognition has achieved a great interest in
recent years, as skeleton data has been demonstrated to be robust to
illumination changes, body scales, dynamic camera views, and complex
background. Nevertheless, an effective encoding of the latent information
underlying the 3D skeleton is still an open problem. In this work, we propose a
novel Spatial-Temporal Transformer network (ST-TR) which models dependencies
between joints using the Transformer self-attention operator. In our ST-TR
model, a Spatial Self-Attention module (SSA) is used to understand intra-frame
interactions between different body parts, and a Temporal Self-Attention module
(TSA) to model inter-frame correlations. The two are combined in a two-stream
network which outperforms state-of-the-art models using the same input data on
both NTU-RGB+D 60 and NTU-RGB+D 120.Comment: Accepted as ICPRW2020 (FBE2020, Workshop on Facial and Body
Expressions, micro-expressions and behavior recognition) 8 pages, 2 figures.
arXiv admin note: substantial text overlap with arXiv:2008.0740
What can a cook in Italy teach a mechanic in India? Action Recognition Generalisation Over Scenarios and Locations
We propose and address a new generalisation problem: can a model trained for
action recognition successfully classify actions when they are performed within
a previously unseen scenario and in a previously unseen location? To answer
this question, we introduce the Action Recognition Generalisation Over
scenarios and locations dataset (ARGO1M), which contains 1.1M video clips from
the large-scale Ego4D dataset, across 10 scenarios and 13 locations. We
demonstrate recognition models struggle to generalise over 10 proposed test
splits, each of an unseen scenario in an unseen location. We thus propose CIR,
a method to represent each video as a Cross-Instance Reconstruction of videos
from other domains. Reconstructions are paired with text narrations to guide
the learning of a domain generalisable representation. We provide extensive
analysis and ablations on ARGO1M that show CIR outperforms prior domain
generalisation works on all test splits. Code and data:
https://chiaraplizz.github.io/what-can-a-cook/.Comment: Accepted at ICCV 2023. Project page:
https://chiaraplizz.github.io/what-can-a-cook
Domain generalization through audio-visual relative norm alignment in first person action recognition
First person action recognition is becoming an increasingly researched area thanks to the rising popularity of wearable cameras. This is bringing to light cross-domain issues that are yet to be addressed in this context. Indeed, the information extracted from learned representations suffers from an intrinsic "environmental bias". This strongly affects the ability to generalize to unseen scenarios, limiting the application of current methods to real settings where labeled data are not available during training. In this work, we introduce the first domain generalization approach for egocentric activity recognition, by proposing a new audiovisual loss, called Relative Norm Alignment loss. It rebalances the contributions from the two modalities during training, over different domains, by aligning their feature norm representations. Our approach leads to strong results in domain generalization on both EPIC-Kitchens-55 and EPIC-Kitchens-100, as demonstrated by extensive experiments, and can be extended to work also on domain adaptation settings with competitive results
PoliTO-IIT Submission to the EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition
In this report, we describe the technical details of our submission to the
EPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action
Recognition. To tackle the domain-shift which exists under the UDA setting, we
first exploited a recent Domain Generalization (DG) technique, called Relative
Norm Alignment (RNA). It consists in designing a model able to generalize well
to any unseen domain, regardless of the possibility to access target data at
training time. Then, in a second phase, we extended the approach to work on
unlabelled target data, allowing the model to adapt to the target distribution
in an unsupervised fashion. For this purpose, we included in our framework
existing UDA algorithms, such as Temporal Attentive Adversarial Adaptation
Network (TA3N), jointly with new multi-stream consistency losses, namely
Temporal Hard Norm Alignment (T-HNA) and Min-Entropy Consistency (MEC). Our
submission (entry 'plnet') is visible on the leaderboard and it achieved the
1st position for 'verb', and the 3rd position for both 'noun' and 'action'.Comment: 3rd place in the 2021 EPIC-KITCHENS-100 Unsupervised Domain
Adaptation Challenge for Action Recognitio
EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge: Mixed Sequences Prediction
This report presents the technical details of our approach for the
EPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action
Recognition. Our approach is based on the idea that the order in which actions
are performed is similar between the source and target domains. Based on this,
we generate a modified sequence by randomly combining actions from the source
and target domains. As only unlabelled target data are available under the UDA
setting, we use a standard pseudo-labeling strategy for extracting action
labels for the target. We then ask the network to predict the resulting action
sequence. This allows to integrate information from both domains during
training and to achieve better transfer results on target. Additionally, to
better incorporate sequence information, we use a language model to filter
unlikely sequences. Lastly, we employed a co-occurrence matrix to eliminate
unseen combinations of verbs and nouns. Our submission, labeled as 'sshayan',
can be found on the leaderboard, where it currently holds the 2nd position for
'verb' and the 4th position for both 'noun' and 'action'.Comment: 2nd place in the 2023 EPIC-KITCHENS-100 Unsupervised Domain
Adaptation Challenge for Action Recognitio
An Outlook into the Future of Egocentric Vision
What will the future be? We wonder! In this survey, we explore the gap
between current research in egocentric vision and the ever-anticipated future,
where wearable computing, with outward facing cameras and digital overlays, is
expected to be integrated in our every day lives. To understand this gap, the
article starts by envisaging the future through character-based stories,
showcasing through examples the limitations of current technology. We then
provide a mapping between this future and previously defined research tasks.
For each task, we survey its seminal works, current state-of-the-art
methodologies and available datasets, then reflect on shortcomings that limit
its applicability to future research. Note that this survey focuses on software
models for egocentric vision, independent of any specific hardware. The paper
concludes with recommendations for areas of immediate explorations so as to
unlock our path to the future always-on, personalised and life-enhancing
egocentric vision.Comment: We invite comments, suggestions and corrections here:
https://openreview.net/forum?id=V3974SUk1
Skeleton-based action recognition via spatial and temporal transformer networks
Skeleton-based Human Activity Recognition has achieved great interest in
recent years as skeleton data has demonstrated being robust to illumination
changes, body scales, dynamic camera views, and complex background. In
particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated
to be effective in learning both spatial and temporal dependencies on
non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding
of the latent information underlying the 3D skeleton is still an open problem,
especially when it comes to extracting effective information from joint motion
patterns and their correlations. In this work, we propose a novel
Spatial-Temporal Transformer network (ST-TR) which models dependencies between
joints using the Transformer self-attention operator. In our ST-TR model, a
Spatial Self-Attention module (SSA) is used to understand intra-frame
interactions between different body parts, and a Temporal Self-Attention module
(TSA) to model inter-frame correlations. The two are combined in a two-stream
network, whose performance is evaluated on three large-scale datasets,
NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400, consistently improving
backbone results. Compared with methods that use the same input data, the
proposed ST-TR achieves state-of-the-art performance on all datasets when using
joints' coordinates as input, and results on-par with state-of-the-art when
adding bones information.Comment: Accepted at Computer Vision and Image Understanding (CVIU) 12 pages,
8 figure